Inferring DNA sequences from mechanical unzipping: an ideal-case study 
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We introduce and test a method to predict the sequence of DNA molecules from in silico unzipping 
experiments. The method is based on Bayesian inference and on the Viterbi decoding algorithm. 
The probability of misprediction decreases exponentially with the number of unzippings, with a 
decay rate depending on the applied force and the sequence content. 
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DNA molecules are the support for the genetic infor- 
mation, and knowledge of their sequences is very im- 
portant from the biological and medical points of view. 
State-of-the-art DNA sequencing methods rely on bio- 
chemical and gel electrophoresis techniques 0, and are 
able to correctly predict about 99.9% of the bases. They 
were massively used over the past ten year to obtain the 
human genome (and the ones of other organisms). 

Nevertheless, the quest for alternative (cheaper and/or 
faster) sequencing methods is an active field of re- 
search. In this regard, recent single molecule micro- 
manipulations are of particular interest. Amongthem are 
DNA unzipping under a mechanical action 0,13, HI El 
or due to translocation through nanopores [7j , the obser- 
vation of the sequence-dependent activity of an exonu- 
clease the optical analysis of DNA polymerization 

in a nano-chip device ^} > the detection of single DNA 
hybridization Hereafter, we focus on mechanical 

unzipping (see Figure ^) , first realized by Bockelmann, 
Heslot and coworkers in 1997 @jI3- In their experiment, 
the strands are pulled apart under a constant velocity. 
The force is measured and fluctuates around 15 pN for 
the A-phage DNA (a 48, 502 base long virus), with higher 
(respectively, lower) values corresponding to the unzip- 
ping of GC (AT) rich regions. Researchers have also un- 
zipped RNA molecules 0, 0, 0, or DNA under a con- 
stant force (instead of velocity) |6|. Figure 0^ sketches a 
fixed-force output signal, with its pauses in the opening 
at sequence-specific positions. 

Various theoretical works have studied and repro- 
duced the unzip ping si gnal related to a given sequence 
HHHQmtllll3- Hereafter we address the inverse 
problem: given an unzipping signal (for example the one 
of Figure |2K) , can we predict the underlying sequence? 
We propose a Bayesian inference method to solve this 
problem 0], and test it in silico on the A-phage. We 
analytically study the dependence of the quality of the 
prediction on the sequence content, on the force, and on 
the number of unzippings. Finally we list the main ob- 
stacles to be circumvented prior to practical applications. 

Let S = {bi, &2j • • • ; bff} denote the sequence of N 
bases along the 5' — * 3' strand (the other strand is com- 
plementary). We model the unzipping of the molecule 
through the evolution of the number n of open base 
pairs [L3|; base pair opening (n — > n + 1) and closing 
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FIG. 1: An unzipping experiment. The extremities of the 
molecule are stretched apart under a force /. The fork at 
location n (nb. of open base pairs) moves backward or forward 
with rates (probability per unit of time) r c and r Q. 
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FIG. 2: Fixed-force unzipping of A-phage. A. number n of 
open base pairs vs. time t for forces / ranging from 15.5 to 17 
pN from model (0. B. magnification of the boxed region in A 
after a 90 degree clockwise rotation. C. free energy landscape 
g(n) versus n for the first 450 bases and / = 16 pN. Down and 
up arrows indicate, respectively, a local minimum in n = 50 
and two maxima in n — 232 and n = 327 (see the text). 



(n — > n — 1) happen with rates (Figure ^) 

r {n) = r exp{g («-)} , r c = r exp{a> ss } 



(1) 



go(n) is the binding energy of base pair (bp) n in units of 
kgT 19]; it depends on the base b n = A, T. G, or C and. 
due to stacking effects, on the nearest base 5ss is 



the work needed to stretch an open bp under a force / in 
units of kfjT ; according to the modified freely-jointed- 
chain model |12| . g ss = —2£/£q ln[sinh(a;)/x] where x = 
@o f /ksT, and £q = 15 A and I = 5.6 A are, respectively, 
the Kuhn and effective nucleotide lengths. Relation |JTJ 
implies that the opening rate at base n is a function of 
the sequence, r (n) = r n (b n ,b n ^ ), while the closing rate 
r c only depends on the force [20] . This a priori choice has 
been shown to reproduce quantitatively the behavior 
of unzipping experiments on short polynucleotides Q, 
with a typical frequency r ~ 10 6-7 sec -1 . 

Rates (y| define a one-dimensional biased random walk 
for the fork position (number of open bp) n(t) in the po- 

n 

tential g(n) = ng ss — ^""^ 9o(i)j that can be interpreted 

i=l 

as the free energy of the molecule when the first n bp are 
open. We show in Figure [5J3&C a typical time-trace of 
n(t) generated by Monte Carlo (MC) simulation for the 
A-phage sequence, together with the free energy land- 
scape g(n). Plateaus of n(t) coincide with deep local 
minima of g(n), where the fork remains trapped for a 
long time. As the force increases, opening becomes more 
favorable, and plateaus shrink. 

Our in silico time-traces are stochastic due to the ther- 
mal noise: two runs will give different traces. The prob- 
ability of a time-trace only depends on the set J\f — 
{t n , u n ,d n } of times t n spent on each base n, and of num- 
bers u n and d n of up (n — > n + 1) and down (n — > n — 1) 
transitions respectively. Given the sequence 5, this prob- 
ability reads 



P(JV|5)=cJ] M(b n ,b n+ 



(2) 



where c is a (sequence-independent) normal- 
ization constant and M(b n , b n +i;t n , u n , d n ) = 
r Q (b n ,b n+ i) Un r d c n exp{-(r (&„, b n+ i) + r c )t n }. Equa- 
tion J2J provides the solution of the direct problem: 
given the sequence S what is the distribution of the 
time-traces A/"? The inverse problem, that is the pre- 
diction of the sequence given some time-trace, can be 
addressed within the Bayesian inference framework. The 
probability that DNA sequence is S given an observed 
A/" is |3 



V(S\Af) 



7? (A/IS) VojS) 



(3) 



The value of S that maximizes this probability, <S*, is 
our prediction for the sequence. In the absence of any a 
priori information about the sequence, Vq(S) is the flat 
distribution, equal to 4~ N . The maximization of V(S\J\f) 
then reduces to that of V(Af\S) ©. 

In practice the most likel y se quence S* may be found 
using the Viterbi algorithm [2l| ■ The procedure is equiv- 
alent to a zero temperature transfer matrix technique 
exploiting the nearest-neighbor nature of couplings be- 
tween bases in ©. The probability P n for the base b n 



fulfills the recursive equation 
Pn+i(b n +i) cx max P n (b n ) M(b n ,b n+ i; 

). (4) 

where the proportionality constant is irrelevant for our 
purpose. The maximum in Q is reached for some base 
b™ ax (b n+ i) that depends on the next base b n+1 . Starting 
from P\{bi) = j, we obtain the probability Pn^n) for 
the last base of the sequence through iterations of 10} . 
Maximization of PnQjn) yields the most likely value for 
this last base, b^. The whole optimal sequence S* is then 
recursively obtained from the relation b^_ 1 = b^f(b^). 

We have tested our sequencing method on the A-phage. 
First we build a dynamical process on the sequence S x 
of the phage with rates |T|I. and generate an unzipping 
trace N by a MC procedure. Then we use the Viterbi 
procedure (which ignores the phage sequence) to make a 
prediction for the sequence, S* , from this signal M . We 
estimate the error over the prediction about base n from 
the failure rate 



e„ = Probability [6* ^ 6*] 



(5) 



where the probability is computed by repeating the pro- 
cedure over different MC runs. The errors e„ are shown 
in Figure (with the continuous curve) for the first 450 
bases at a force of 16 pN. Values range from (per- 
fect prediction) to 0.75 (random guess of one among 
four bases). A comparison with the free energy g(n) 
(Figure shows that e n is small in the flattest part 
of the landscape (350 < n < 450), or in local minima 
e.g. the n — 50 base preceded by 4 weak bases and 
followed by 4 strong bases (...TTTA-A-GGCG...). Con- 
versely, bases that are not well determined correspond to 
local maxima of the landscape e.g. n — 327, 328 bases 
between 7 strong and 7 weak bases (...GCCGCCG-TC- 
ATAAAAT...). We plot the average fraction of mispre- 
dicted bases, e = — ^^e n , in Figure As shown in 

n 

Fig. 21 for a larger force, there are more open bases (about 
60, 600 and 5000 at 15.5, 16 and 17 pN in about 100 sec- 
onds), but the time spent on each base is smaller, and 
therefore e is larger (e = 20%, 23%, 47%). Most errors 
are due to the difficulty of distinguishing A from T, and 
G from C. The probability that a weak (A or T) base 
is confused with a strong one (G or C), or vice- versa, is 
plotted in Figure ^3. 

Performances can be greatly improved by collecting 
information from multiple unzippings. As the number 
of passages over the same base n gets larger, the total 
waiting times t n and transition parameters u n , d n become 
less affected by fluctuations, and reflect more faithfully 
the thermodynamic signature of the base. In practice, we 
look for the most likely sequence S* given R unzipping 
signals J\f\, A/a, • • ■ , Mr. FiguresEK and^shows the drop 
down in the probability of error when the number R of 
unzippings increases. Observe from Figure [SJ'V that the 
decay of e„ with R (J5J varies from base to base. The 
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FIG. 3: A. Probability e n of an error (top) and entropy a„ 
(middle) versus the base index n, for the first 450 bp of DNA 
A-phage at / = 16 pN. Full lines correspond to R — 1 unzip- 
ping, dotted lines to R = 40. B. Theoretical values for the 
decay constants R^ in e n Q. For instance, base 232 (arrow) 
is characterized by -R232 — 10, and is not (respectively, well) 
predicted with R = 1 (resp. R = 40) unzippings. 
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FIG. 4: A. Fraction e of mispredicted bases for the A-phage 
versus the number R of unzippings, averaged over 1000 sam- 
ples of R unzippings, and for forces of 15.5, 16 and 17 pN 
(from bottom to top). B. Same as A, but we only discrimi- 
nate among weak and strong basis. 



decrease of the total error e is much faster for AT vs. GC 
(Figure 0)3) than for complete (Figure ^}Y) recognition. 

It is useful to build indicators of performances that 
do not rely on the exact knowledge of the unzipped se- 
quence (used here for checking the quality of our results 
but unknown in practical applications). To this aim, we 
calculate the optimal sequences S£ when base n is con- 
strained to value 6, and the corresponding probabilities 
P*(6). We then define the Shannon entropy 



E 

b=A,T,G,C 



(P*(b) log 4 P„*(6)), 



(6) 



where (■) denotes the average over MC data. a n is low 
when one of the four bases has much higher probability 
than the other ones and close to unity for uncertain pre- 
dictions (equiprobable bases). Figure shows that a n 
and e n as a function of the base index n are indeed very 
similar: the Shannon entropy is a good indicator of the 
success of our reconstruction. 



Our analytical study of the dependence of the quality 
of the prediction upon the force, the sequence content, 
and the number of unzippings confirms that the proba- 
bility of error e ra decreases very quickly with R, 



(7) 



As / decreases to its critical value (below which the 
molecule cannot open), the decay constant R° decreases 
to zero, and predictions drastically improve at fixed R. 
Our theoretical values for R c n are shown in Figure [3)3 for 
/ = 16 pN, and vary from 0.1 to 45 with the base index 
n. The agreement with the decay of e„ from R = 1 to 40 
unzippings (Figure |3J\) is excellent. Note that e in Fig- 
ure 21 is not a pure exponential, but a superposition of 
exponentials with n-dependent decay constants R^. We 
now present the calculation of R c n in three steps. 

(a) Pairing only, high force. Assume first that there 
are only 2 and not 4 bp- types, called + and — , and no 
stacking interaction. Call A the difference between the 
(pairing) free-energies of + and — bp, and (t±) the av- 
erage time spent by the fork on a ± bp before moving 
forward or backward. Consider now a bp of type b and 
call i the time spent on this bp divided by the number 
R of unzippings. From the central limit theorem, for 
large R, t gets narrowly peaked around its mean value 
(tb), with Gaussian fluctuations St ~ R~?. Bayes pre- 
diction (PJ) will be erroneous, b* = —b, when t is closer 
to (£_{,) than to its expected value (if,). The probabil- 
ity of error is thus given by the Gaussian tail, and scales 
as e ~ exp(— <5i~ 2 ), hence (JTJl. A careful calculation [2(j 
gives the precise value of the decay constant in JJJ), 
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(8) 



Good predictions are obtained when the molecule is un- 
zipped a few R c times (for example R ~ 4i? c gives 
e ~ 2%). To distinguish weak (AT) from strong (CG) 
bp only we have A ~ 2.8 [3 and R c ~ 1 (Figure HP), 
while complete recognition corresponds to A ~ 0.5 and 
R c ~ 30 (Figure HV). 

(b) Pairing and Stacking, high force. In presence of 
stacking interactions, the error ej, on base b depends on 
the neighboring bases, say, x and y. At large R, errors are 
rare and are typically due to a single base mis-prediction 
e.g. b — > b' . The probability £{,_>;,' of this mistake is the 
product of the probabilities e x b-*xV and eby~*b'y of the 
two bond violations. We estimate e x b~, x b' ~ e^ R ^ R xb^xV 
from 10 where R xb ^ x b' 1S gi ven by © with A = g^ b — 
gQ b . A similar expression is readily obtained for the by 
bond 

calculate Cb 
b\ 



Knowing the asymptotic behavior of et, , we 

by selecting the worst value for 
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The above derivation is confirmed by exact calculations 
based on techniques for ID disordered systems [20ll22|. 



(c) Moderate force. The above calculations are correct 
for high forces. At moderate forces, bp can close and are 
visited several times by the fork. The effective number of 
unzippings is R x (u n ), where (u n ) is the average number 
of openings of bp n during a single unzipping. The decay 
constant is thus, from 

RZ=m n _ lKbn+1 /{* n ) ■ (io) 

As the force is lowered, (u n ) increases (from 1 at high 
force), and ¥L c n diminishes. To calculate (u n ), we con- 
sider the ID transient random walk defined by the prob- 
abilities q m = r c /(r (m) + r c ) and 1 — q m for closing or 

(n) 

opening bp m. Let p m be the probability that the fork 
will never reach position n starting from m(> n). The 
ratios p^} = Pm V^m+i fulfill the Riccati recursion rela- 
tion [23 Pm+l = Qm+i)/(l - qm+i/3m ) ). Iterating 

(n) 

with boundary condition p n — allows us to obtain 

(tin) = l/p [ n +1) = H m>n Pfn- 

Finally we discuss the difficulties hindering a direct 
application of our inference method to real data (see also 
[23), and possible way-outs. 

First, temporal resolution is limited in practice. The 
frequency bandwidth is controlled by the viscous friction 
and the stiffness of the setup, with a typical value of 10 
kHz |3|,|2J]. The corresponding time, St ~ 100 /isec, is 
about 10 (resp. 200) times longer than the typical open- 
ing time for GC (resp. AT) bp. As a result, the fork 
can move by D(> 1) bp during the time interval St. We 
have taken into account such moves by considering inter- 
actions between bases at distance < D in the probability 
P(J\f\S), and modified the reconstruction procedure ac- 
cordingly (the transfer matrix has now dimension 4 D ) 
[20|. In practice, when St — 1 /isec, sequences cannot 
be predicted with the usual D = 1 reconstruction pro- 



cedure, but are correctly inferred with the D = 6 proce- 
dure. Though time resolution is currently far below this 
limit, future experimental progresses, and new technolo- 
gies e.g. combination of optical trap and single-molecule 
fluorescence [2(|, could help bridging the gap. 

Secondly, thermal fluctuations of the open strands lead 
to an uncertainty Sn over the position n of the fork 1231 
e.g. Sn ~ 5 for / ~ 15 pN and n = 300 open bp jl2| . 
The presence of correlations between bases at distance 
D < Sn does not affect the result {7J) for e„ as long as 
the relaxation time of the strands is smaller than the bp 
opening time i.e. up to a few hundreds open bp. What 
happens for larger values of n is currently under study. 

Thirdly, we have assumed so far to have a perfect 
knowledge of the dynamics of unzipping. In practice, any 
functional form for V(Af\S) will be only approximate for 
a given experimental setup. A possible way-out based on 
a learning principle is the following: in a first stage unzip- 
ping data corresponding to a known sequence (A-phage) 
are collected to caliber V, in a second stage predictions 
are made for new sequences. 

Last of all, our study of fixed-force unzipping shows 
that bases located in local minima of the free-energy 
landscape are well predicted, while maxima are much 
harder to predict. Accuracy could be greatly improved 
through an adequate force vs. time scheme capable of 
bringing the fork in the right place and making it spend 
time there. Investigation of the fixed- velocity case, where 
the force signal is remarkably affected by single base mu- 
tation [3j, will be very interesting. 

In conclusion, we hope the present study will motivate 
further work to assess and improve the performances of 
unzipping-based sequencing. 
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